Supervised Learning-Aided Optimization of Expert-Driven Functional Protein Sequence Annotation

نویسندگان

  • Lev Soinov
  • Alexander Kanapin
  • Misha Kapushesky
چکیده

The aim of this work is to use a supervised learning approach to identify sets of motif-based sequence characteristics, combinations of which can give the most accurate annotation of new proteins. We assess several of InterPro Consortium member databases for their informativeness for the annotation of full-length protein sequences. Thus, our study addresses the problem of integrating biological information from various resources. Decision-rule algorithms are used to cross-map different biological classification systems in order to optimise the process of functional annotation of protein sequences. Various features (e.g., keywords, GO terms, structural complex names) may be assigned to a sequence via its characteristics (e.g., motifs built by various protein sequence analysis methods) with the developed approach. We chose SwissProt keywords as the set of features on which to perform our analysis. From the presented results one can quickly obtain the best combinations of methods appropriate for the description of a given class of proteins. Introduction Availability of a wide variety of effective protein sequence analysis methods calls for an evaluation of their comparative performance and for development of approaches to integrated cross-method consistent annotation. A natural resolution of this problem came with the creation of InterPro [1]. The InterPro database is a single resource collecting sequence pattern data from PROSITE ([2], a repository of regular expressions and profiles), Pfam ([3], based on hidden Markov models), PRINTS ([4], provider of fingerprints) and from several other databases-participants. InterPro is a manually curated database, in which the curation process is supported by various automated procedures. One of the most straightforward approaches to characterizing a novel sequence is to compare it to the already annotated in InterPro proteins. While this potentially can produce high-quality functional predictions, the motif-focused nature of InterPro complicates the interpretation of such analyses, because in most cases it is impossible to find a single InterPro entry corresponding to the combination of motifs found in a given sequence. On the other hand, there are various systems for direct functional annotation of full-length protein sequences, such as GeneOntology (GO) [5] or SwissProt [6]. The quality of annotation that one can get based on similarity to InterPro entries/motifs can thus be improved by combining these two annotation paradigms. A correspondence between the annotation specific for a group of previously characterised full-length protein sequences and the domain/repeat architecture of a given sequence could help to achieve a more complete functional description of a protein related …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Operation Sequencing Optimization in CAPP Using Hybrid Teaching-Learning Based Optimization (HTLBO)

Computer-aided process planning (CAPP) is an essential component in linking computer-aided design (CAD) and computer-aided manufacturing (CAM). Operation sequencing in CAPP is an essential activity. Each sequence of production operations which is produced in a process plan cannot be the best possible sequence every time in a changing production environment. As the complexity of the product incr...

متن کامل

A CAD System Framework for the Automatic Diagnosis and Annotation of Histological and Bone Marrow Images

Due to ever increasing of medical images data in the world’s medical centers and recent developments in hardware and technology of medical imaging, necessity of medical data software analysis is needed. Equipping medical science with intelligent tools in diagnosis and treatment of illnesses has resulted in reduction of physicians’ errors and physical and financial damages. In this article we pr...

متن کامل

Automatic Assignment of Protein Function with Supervised Classifiers

Automatic Assignment of Protein Function with Supervised Classifiers. (August 2008) Jae Hee Jung, B.S., Dongduk Women’s University; M.S., Korea University Chair of Advisory Committee: Dr. Michael R. Thon High-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common...

متن کامل

Improving supervised classification accuracy using non-rigid multimodal image registration: Detecting Prostate Cancer

Computer-aided diagnosis (CAD) systems for the detection of cancer in medical images require precise labeling of training data. For magnetic resonance (MR) imaging (MRI) of the prostate, training labels define the spatial extent of prostate cancer (CaP); the most common source for these labels is expert segmentations. When ancillary data such as whole mount histology (WMH) sections, which provi...

متن کامل

Prototype-Driven Learning for Sequence Models

We investigate prototype-driven learning for primarily unsupervised sequence modeling. Prior knowledge is specified declaratively, by providing a few canonical examples of each target annotation label. This sparse prototype information is then propagated across a corpus using distributional similarity features in a log-linear generative model. On part-of-speech induction in English and Chinese,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004